Automatic Acquisition of Basic Katakana Lexicon from a Given Corpus

نویسندگان

  • Toshiaki Nakazawa
  • Daisuke Kawahara
  • Sadao Kurohashi
چکیده

Katakana, Japanese phonogram mainly used for loan words, is a trou-blemaker in Japanese word segmentation. Since Katakana words are heavily domain-dependent and there are many Katakana neologisms, it is almost impossible to construct and maintain Katakana word dictionary by hand. This paper proposes an automatic segmentation method of Japanese Katakana compounds, which makes it possible to construct precise and concise Katakana word dictionary automati-cally, given only a medium or large size of Japanese corpus of some domain.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Enlarging the Croatian Morphological Lexicon by Automatic Lexical Acquisition from Raw Corpora

This paper presents experiments for enlarging the Croatian Morphological Lexicon by applying an automatic acquisition methodology. The basic sources of information for the system are a set of morphological rules and a raw corpus. The morphological rules have been automatically derived from the existing Croatian Morphological Lexicon and we have used in our experiments a subset of the Croatian N...

متن کامل

Automatic Acquisition of a Slovak Lexicon from a Raw Corpus

This paper presents an automatic methodology we used in an experiment to acquire a morphological lexicon for the Slovak language, and the lexicon we obtained. This methodology extends and refines approaches which have proven efficient, e.g., for the acquisition of French verbs or Croatian and Russian nouns, adjectives and verbs. It only relies on a raw corpus and on a morphological description ...

متن کامل

Automatic Construction of Japanese KATAKANA Variant List from Large Corpus

This paper presents a method to construct Japanese KATAKANA variant list from large corpus. Our method is useful for information retrieval, information extraction, question answering, and so on, because KATAKANA words tend to be used as “loan words” and the transliteration causes several variations of spelling. Our method consists of three steps. At step 1, our system collects KATAKANA words fr...

متن کامل

Lexicon Acquisition with and for Symbolic NLP-Systems – a Bootstrapping Approach

We present a method of applying a broad-coverage LFG grammar of German in the process of semi-automatic lexicon acquisition from corpora. The identification of corpus instances that illustrate a certain subcategorization frame uniquely is done by a comparison of the numbers of analyses the grammar assigns to the corpus instances, under the assumption of different hypothetical lexicon entries fo...

متن کامل

Data-driven Amharic-English Bilingual Lexicon Acquisition

This paper describes a simple approach of statistical language modelling for bilingual lexicon acquisition from Amharic-English parallel corpora. The goal is to induce a seed translation lexicon from sentence-aligned corpora. The seed translation lexicon contains matches of Amharic lexemes to weekly inflected English words. Purely statistical measures of term distribution are used as the basis ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005